quantization strategy
MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling
Zhang, Yu, Zhen, Hui-Ling, Yuan, Mingxuan, Yu, Bei
Training large language models with FP8 formats offers significant efficiency gains. However, the reduced numerical precision of FP8 poses challenges for stable and accurate training. Current frameworks preserve training performance using mixed-granularity quantization, i.e., applying per-group quantization for activations and per-tensor/block quantization for weights. While effective, per-group quantization requires scaling along the inner dimension of matrix multiplication, introducing additional dequantization overhead. Moreover, these frameworks often rely on just-in-time scaling to dynamically adjust scaling factors based on the current data distribution. However, this online quantization is inefficient for FP8 training, as it involves multiple memory reads and writes that negate the performance benefits of FP8. To overcome these limitations, we propose MOSS, a novel FP8 training framework that ensures both efficiency and numerical stability. MOSS introduces two key innovations: (1) a two-level microscaling strategy for quantizing sensitive activations, which balances precision and dequantization cost by combining a high-precision global scale with compact, power-of-two local scales; and (2) automatic scaling for weights in linear layers, which eliminates the need for costly max-reduction operations by predicting and adjusting scaling factors during training. Leveraging these techniques, MOSS enables efficient FP8 training of a 7B parameter model, achieving performance comparable to the BF16 baseline while achieving up to 34% higher training throughput. Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, including reasoning, language understanding, and generation (Achiam et al., 2023; Grattafiori et al., 2024; Liu et al., 2024; Adler et al., 2024).
Layer-Wise High-Impact Parameter Ratio Optimization in Post-Training Quantization for Large Language Models
Pham, Cuong, Dung, Hoang Anh, Nguyen, Cuong C., Le, Trung, Carneiro, Gustavo, Do, Thanh-Toan
Large language models (LLMs) have significantly advanced natural language processing, but their massive parameter counts create substantial computational and memory challenges during deployment. Post-training quantization (PTQ) has emerged as a promising approach to mitigate these challenges with minimal overhead. While existing PTQ methods can effectively quantize LLMs, they experience substantial accuracy loss at extremely low bit-widths, primarily due to high-impact parameters that significantly influence quantization performance. Several approaches address these issues by identifying and retaining the high-impact parameters in FP16 format. However, they apply fixed ratios of high-impact parameters across all layers, overlooking layer-wise sensitivity variations. In this paper, we propose a quadratic optimization framework that determines layer-specific ratios of high-impact parameters while considering inter-layer dependencies. We quantize high-impact parameters to moderate bit-widths, which often result in negligible performance degradation in quantized LLMs, while the remaining parameters can be quantized to extremely low bit-widths. Under the same resource-constrained budget, this allows for preserving more high-impact parameters than methods that keep selecting a few in FP16 format. Additionally, the proposed framework allows us to leverage an advanced quantization method that often requires extensive learnable parameters solely for high-impact parameters, while applying a computationally efficient method to the rest. Our approach achieves an effective balance between computational efficiency and model accuracy while maintaining high performance compared to state-of-the-art methods.
- Oceania > Australia (0.04)
- Europe > United Kingdom > England > Surrey (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models
Wang, Wenjun, Cai, Shuo, Xie, Congkai, Feng, Mingfa, Zhang, Yiming, Li, Zhen, Yang, Kejing, Li, Ming, Cao, Jiannong, Yang, Hongxia
The immense computational cost of training Large Language Models (LLMs) presents a major barrier to innovation. While FP8 training offers a promising solution with significant theoretical efficiency gains, its widespread adoption has been hindered by the lack of a comprehensive, open-source training recipe. To bridge this gap, we introduce an end-to-end FP8 training recipe that seamlessly integrates continual pre-training and supervised fine-tuning. Our methodology employs a fine-grained, hybrid-granularity quantization strategy to maintain numerical fidelity while maximizing computational efficiency. Through extensive experiments, including the continue pre-training of models on a 160B-token corpus, we demonstrate that our recipe is not only remarkably stable but also essentially lossless, achieving performance on par with the BF16 baseline across a suite of reasoning benchmarks. Crucially, this is achieved with substantial efficiency improvements, including up to a 22% reduction in training time, a 14% decrease in peak memory usage, and a 19% increase in throughput. Our results establish FP8 as a practical and robust alternative to BF16, and we will release the accompanying code to further democratize large-scale model training.
CHORD: Customizing Hybrid-precision On-device Model for Sequential Recommendation with Device-cloud Collaboration
Liu, Tianqi, Fu, Kairui, Zhang, Shengyu, Fan, Wenyan, Du, Zhaocheng, Zhu, Jieming, Wu, Fan, Wu, Fei
With the advancement of mobile device capabilities, deploying reranking models directly on devices has become feasible, enabling real-time contextual recommendations. When migrating models from cloud to devices, resource heterogeneity inevitably necessitates model compression. Recent quantization methods show promise for efficient deployment, yet they overlook device-specific user interests, resulting in compromised recommendation accuracy. While on-device finetuning captures personalized user preference, it imposes additional computational burden through local retraining. To address these challenges, we propose a framework for \underline{\textbf{C}}ustomizing \underline{\textbf{H}}ybrid-precision \underline{\textbf{O}}n-device model for sequential \underline{\textbf{R}}ecommendation with \underline{\textbf{D}}evice-cloud collaboration (\textbf{CHORD}), leveraging channel-wise mixed-precision quantization to simultaneously achieve personalization and resource-adaptive deployment. CHORD distributes randomly initialized models across heterogeneous devices and identifies user-specific critical parameters through auxiliary hypernetwork modules on the cloud. Our parameter sensitivity analysis operates across multiple granularities (layer, filter, and element levels), enabling precise mapping from user profiles to quantization strategy. Through on-device mixed-precision quantization, CHORD delivers dynamic model adaptation and accelerated inference without backpropagation, eliminating costly retraining cycles. We minimize communication overhead by encoding quantization strategies using only 2 bits per channel instead of 32-bit weights. Experiments on three real-world datasets with two popular backbones (SASRec and Caser) demonstrate the accuracy, efficiency, and adaptivity of CHORD.
- Europe > Ireland > Leinster > County Dublin > Dublin (0.05)
- Asia > China > Zhejiang Province > Hangzhou (0.05)
- Asia > China > Shanghai > Shanghai (0.04)
- (3 more...)
QuantX: A Framework for Hardware-Aware Quantization of Generative AI Workloads
Ahmad, Muhammad, Mazher, Khurram, Akram, Saqib, Tameem, Ahmad, Nasir, Saad Bin
We present QuantX: a tailored suite of recipes for LLM and VLM quantization. It is capable of quantizing down to 3-bit resolutions with minimal loss in performance. The quantization strategies in QuantX take into account hardware-specific constraints to achieve efficient dequantization during inference ensuring flexible trade-off between runtime speed, memory requirement and model accuracy. Our results demonstrate that QuantX achieves performance within 6% of the unquantized model for LlaVa-v1.6 quantized down to 3-bits for multiple end user tasks and outperforms recently published state-of-the-art quantization techniques. We further integrate one particular technique from QuantX into the popular Llama.cpp framework and show its feasibility in terms of runtime compared to the mainstream quantization techniques from Llama.cpp. Lastly, this manuscript provides insights into the LLM quantization process that motivated the range of recipes and options that are incorporated in QuantX.
SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models
Zhang, Jiaji, Sun, Ruichao, Zhao, Hailiang, Wu, Jiaju, Chen, Peng, Li, Hao, Liu, Yuying, Chow, Kingsum, Xiong, Gang, Deng, Shuiguang
Diffusion models have demonstrated exceptional generative capabilities but are computationally intensive, posing significant challenges for deployment in resource-constrained or latency-sensitive environments. Quantization offers an effective means to reduce model size and computational cost, with post-training quantization (PTQ) being particularly appealing due to its compatibility with pre-trained models without requiring retraining or training data. However, existing PTQ methods for diffusion models often rely on architecture-specific heuristics that limit their generalizability and hinder integration with industrial deployment pipelines. To address these limitations, we propose SegQuant, a unified quantization framework that adaptively combines complementary techniques to enhance cross-model versatility. SegQuant consists of a segment-aware, graph-based quantization strategy (SegLinear) that captures structural semantics and spatial heterogeneity, along with a dual-scale quantization scheme (DualScale) that preserves polarity-asymmetric activations, which is crucial for maintaining visual fidelity in generated outputs. SegQuant is broadly applicable beyond Transformer-based diffusion models, achieving strong performance while ensuring seamless compatibility with mainstream deployment tools.
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- Europe > Italy > Lombardy > Milan (0.04)
- Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.04)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Leisure & Entertainment (1.00)
- Media > Photography (0.68)
- Consumer Products & Services (0.67)
FedHQ: Hybrid Runtime Quantization for Federated Learning
Zheng, Zihao, Wang, Ziyao, Cui, Xiuping, Li, Maoliang, Chen, Jiayu, Yun, null, Liang, null, Li, Ang, Chen, Xiang
Federated Learning (FL) is a decentralized model training approach that preserves data privacy but struggles with low efficiency. Quantization, a powerful training optimization technique, has been widely explored for integration into FL. However, many studies fail to consider the distinct performance attribution between particular quantization strategies, such as post-training quantization (PTQ) or quantization-aware training (QAT). As a result, existing FL quantization methods rely solely on either PTQ or QAT, optimizing for speed or accuracy while compromising the other. To efficiently accelerate FL and maintain distributed convergence accuracy across various FL settings, this paper proposes a hybrid quantitation approach combining PTQ and QAT for FL systems. We conduct case studies to validate the effectiveness of using hybrid quantization in FL. To solve the difficulty of modeling speed and accuracy caused by device and data heterogeneity, we propose a hardware-related analysis and data-distribution-related analysis to help identify the trade-off boundaries for strategy selection. Based on these, we proposed a novel framework named FedHQ to automatically adopt optimal hybrid strategy allocation for FL systems. Specifically, FedHQ develops a coarse-grained global initialization and fine-grained ML-based adjustment to ensure efficiency and robustness. Experiments show that FedHQ achieves up to 2.47x times training acceleration and up to 11.15% accuracy improvement and negligible extra overhead.
- North America > United States > Maryland (0.04)
- North America > Canada (0.04)
A probabilistic framework for dynamic quantization
Santini, Gabriele, Paissan, Francesco, Farella, Elisabetta
We propose a probabilistic framework for dynamic quantization of neural networks that allows for a computationally efficient input-adaptive rescaling of the quantization parameters. Our framework applies a probabilistic model to the network's pre-activations through a lightweight surrogate, enabling the adaptive adjustment of the quantization parameters on a per-input basis without significant memory overhead. We validate our approach on a set of popular computer vision tasks and models, observing only a negligible loss in performance. Our method strikes the best performance and computational overhead tradeoff compared to standard quantization strategies.
Precision Where It Matters: A Novel Spike Aware Mixed-Precision Quantization Strategy for LLaMA-based Language Models
Maisonnave, Lucas, Moineau, Cyril, Bichler, Olivier, Rastello, Fabrice
Large Language Models (LLMs) have demonstrated remarkable capabilities in various natural language processing tasks. However, their size presents significant challenges for deployment and inference. This paper investigates the quantization of LLMs, focusing on the LLaMA architecture and its derivatives. We challenge existing assumptions about activation outliers in LLMs and propose a novel mixed-precision quantization approach tailored for LLaMA-like models. Our method leverages the observation that activation spikes in LLaMA architectures are predominantly concentrated in specific projection layers. By applying higher precision (FP16 or FP8) to these layers while quantizing the rest of the model to lower bit-widths, we achieve superior performance compared to existing quantization techniques. Experimental results on LLaMA2, LLaMA3, and Mistral models demonstrate significant improvements in perplexity and zero-shot accuracy, particularly for 8-bit per-tensor quantization. Our approach outperforms general-purpose methods designed to handle outliers across all architecture types, highlighting the benefits of architecture-specific quantization strategies. This research contributes to the ongoing efforts to make LLMs more efficient and deployable, potentially enabling their use in resource-constrained environments. Our findings emphasize the importance of considering model-specific characteristics in developing effective quantization pipelines for state-of-the-art language models by identifying and targeting a small number of projections that concentrate activation spikes.
MoQa: Rethinking MoE Quantization with Multi-stage Data-model Distribution Awareness
Zheng, Zihao, Cui, Xiuping, Zheng, Size, Li, Maoliang, Chen, Jiayu, Yun, null, Liang, null, Chen, Xiang
However, the parameter density of LLMs has struggled to keep pace with the diverse and increasing volumes of data to be processed. To address this limitation, the Mix-of-Experts (MoE) has emerged as one of the most promising LLM implementation approach [1]. An MoE model contains multiple "expert" networks, which consist of individual models or specialized layers. And each expert is trained to fit into a different aspect of the data. When deployed in a particular inference scenario, the MoE dynamically selects a subset of these experts to be sparsely activated, allowing the MoE to synthesize the corresponding data distribution [2-4]. Although MoE models demonstrate improved performance in terms of parameter scalability and memory efficiency with sparse activation, it still faces the need for parameter compression [5, 6]. As revealed by a large number of LLM compression studies, quantization has proven to be the most efficient compression method, which reduces model volume by refactoring parameters into low-precision numbers [7]. While, with the development of quantization techniques, the focus of methodology has gradually shifted from the parameters themselves to the mapping relationship between the parameters and the complex data inputs. Some methods, such as GPTQ [8], start to leverage data distribution analysis for establishing a data-parameter mapping to guide iterative channel-wise parameter quantization; And later methods further examine the relative data scale as well as its impact on data-parameter correlation and highlight the significant variation of parameters (e.g., SmoothQuant [9], A WQ [10]), thus achieving mixed precision quantization with better performance (e.g., Atom [11]).
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- Europe > France (0.04)